The CICA Windows Explosion!

home *** CD-ROM | disk | FTP | other *** search

/ The CICA Windows Explosion! / The CICA Windows Explosion! - Disc 2.iso / nt / source.exe / POSIX / GREP / PEP4GR~4.DOC < prev next >

Wrap

Text File | 1991-09-28 | 7KB | 143 lines

# How long a time lies in one little word! -- Shakespeare, Richard II, I, iii # Fine words butter no parsnips. -- Southern proverb [ef]?grep Implementation Changes 'grep' r.e. translation: To buy speed for the novice 'grep' user who deigns not to learn the extended 'egrep' syntax, we translate 'grep' r.e.'s to the 'egrep' superset. It is straightforward enough to surround search patterns meaningful to 'egrep' but not to 'grep'. Odd cases include the -w option, not implemented in standard 'egrep', the defunct -y option, and "tagged expressions", which are done via an exec() of /bin/grep. Tagged exprs, like grep '\(....\).*\1' /usr/dict/words which outputs chaff like "beriberi", "couscous", "hodgepodge", and "lightweight", are weird. The irregularity these exprs lend coupled with a low complexity/utility ratio kept them from being part of 'egrep'. But for this feature, old 'grep' code could be thrown away. 'fgrep' improvement / (partial) unification: In the new release, we trap low-complexity disjunctions such as egrep 'boyer|moore' file or fgrep 'boyer\n moore' file (or with "-f patfile" in place of the pattern) with a method to superimpose the non-terminals within the Boyer/Moore table. When scanning text backwards, other programming tricks short-circuit some tests against the pattern. Sparing further details, which might make for a more formal writeup, it suffices to say that although worst-case complexity here is O(Rn) with string length 'n', and R == r.e. size, average-case for text is still sublinear. E.g. egrep 'silver|orange|purple' # hard-to-rhyme color test in eg.shell looks at ~55000 chars in /usr/dict/words, whereas (three) separate invocations of egrep on the individual color words make the code look at ~40000 bytes per word. Aho/Corrasick's 'fgrep', in contrast, must look at all 200KB in the dictionary. The elegant "trie" construction within "fgrep" excels, however, for medium/large R. An equally ambitious "reverse trie", supplementing our extant "alternation mush", would attenuate worst-case behavior while preserving low R speedup. We save the addition for another day. Since the syntax for [ef]grep is similar, we thought of making egrep hand off to fgrep for sufficiently large metacharacter-free R, as there is no strong reason to make the user conscious of the separate algorithms. Certain technicalities prevent this. For one, we are not willing to invent an 'egrep' option to inform the code to interpret a file of (say a hundred) word alternatives containing some innocent metacharacter, that it is literal 'fgrep' input, rather than a closure-containing 'egrep' pattern which would otherwise make egrep explode. More work could be done here. Our motivation? Is this not all overblown? Perhaps, but now you can build a simple fast "NSA filter", or search for the seven dwarfs at leisure. Besides, the final nail needed to be driven into 'bm/match', which tried to do parallel match, but actually shuffled things out of order during its simplistic block-based scheme. These programs, part of source archive also, are now historical curiosities. Kanji egrep: Copious notes are in README.kanji.mods. The March 1987 Unix Review was indispensable for pointing out the embedded "SS2" Katakana pitfalls. The modularity of our code as a semi-dependent filter was necessary for this exploration, as we have no access to AT&T/Unisoft Kanji code. Again, JUNET or Sigma project people -- please respond with grep war stories or usage notes. Worst-case r.e. handling: The first code release elaborately called upon a function mcilroy() to pipe partial match output to old egrep for tough expressions, whose backtracking might swamp regexp(). Some details of the DFA/NDFA tradeoff were discussed in pep4grep.doc[12]. Due largely to feedback from John Gilmore of the GNU project, the strategy was revised. egrep.c function kernighan() ("let others do the hard part") now usurps calls to costly popen() by invoking exec() on old egrep when necessary. Rough partial match statistics gathered on the fly determine the handoff. You may revise the time reported previously for egrep 'hoe.*g' /usr/dict/words from 1.2 user, 1.1 sys seconds (VAX 11/750, 4.3BSD, Fuji disks) to 0.8u, 0.4s. For those public-spirited souls who really want to build a PD egrep out of what we offer, sans fallback from regexp() to an AT&T /usr/bin/egrep, the slippery test "egrep 'g+h+o' /usr/dict/words" will prove enlightening. Faster -n option: By popular demand. Though Boyer/Moore techniques subvert line numbering, we've made it faster with brute force (loop unrolling helps VAXEN, but not CRISPS). Timing tests for this and other options appear in the eg.shell script. Not so fast: -v, -b, -w, various r.e.'s with no rexexp() "residue" (you'll still have to use the layered "grep host /etc/hosts | grep -w host" for speed.) Other contra-indications for new [ef]?grep: Monster patterns The amazing expressions output by /usr/lib/calendar still beg for the lazy evaluation technique rolled into edition 8 egrep by Prof. Aho of Princeton. Hinted at on p. 105 in Kernighan & Pike, lazy evaluation reduces standard egrep r.e. compile time. Here the possible O(R**2) machine construction cost is eliminated to amortize complexity at run-time and shifted to such only if a bad match actually happens. Whew! Fortunately, this is not so important for simple r.e. fare, where H. Spencer's regexp() works well, but it certainly helps calendar(1). The catch with lazy eval. is that it slows down simple matching (15-20% for /usr/dict/words on VAX), so it hasn't been adopted by System V egrep. Note that our egrep, deferring to the underlying one in /usr/bin, doesn't care much about these hideous beasts -- it just doesn't do better on them. However, [ef]?grep does well by the Kadhafi matcher (eg.shell, again). Long lines, small alphabets Finally, a comment on one rapidly burgeoning application area where new egrep should not be blindly proscribed -- genome sequencing. Though line limits have been raised (to 8192 byte buffers), much of GENBANK has no newlines. The code would need modification for scanning freestyle. Also, locating ACGT sequences with the current "superimposed BMG" over a 4-letter alphabet might actually be worse, but the global homology crowd probably uses a >20 letter protein alphabet (for other reasons). At any rate, genetic string search generally relies on more sophisticated methods such as dynamic programming ala Sankoff/Kruskal. On the other hand, large alphabets such as Kanji probably help performance. As do parallel transfer disks, MACH file mapping, ... Your suggestions welcome. James A. Woods (ames!jaw) NASA Ames Research Center P.S. Preserving author credit, [ef]?grep may be redistributed as you wish.